lib: execute pthread_join only when tid != 0 in flb_stop.#9428
lib: execute pthread_join only when tid != 0 in flb_stop.#9428
Conversation
Signed-off-by: Phillip Whelan <phillip.whelan@chronosphere.io>
|
@pwhelan can we get more details on why CentOS and Redhat don't conform ? I cannot find a reference about it. It seems to me that the proper approach would be to get the return of pthread_join(3) and print the errors if any:
a simple flb_errno() can do the job since that is set in errno |
That would be great, but pthread_join crashes on RHEL and CentOS before it can even return when passing the first argument as zero. |
|
@pwhelan what are the steps to make it crash ? |
You need to configure a calyptia fleet with an invalid configuration, ie: with a non-existent plugin: Reproducing it is hard under other circumstances. |
|
can we get a minimal repro ? it seems this needs hot reload with a speacial crafted config, something we can repro with curl ? |
You need to run fluent-bit under the CentOS/8 container, modify the configuration file once it is running and then send it SIGHUP. |
|
closing in favor of #9432 |
Summary
This PR wraps the call to
pthread_joininflb_stopin a conditional fortid != 0.Calling
pthread_joinwithtid== 0 is undefined. Technically it should returnESRCHbut CentOS and RHEL do not seem to conform.The following crash occurs:
Debugging the actual code in
__pthread_timedjoin_exwe can see it is in fact aNULLdereference:Root cause analysis
The value of
tidis derived fromctx->config->workerhere:The call to
pthread_joinchecksctx->status, which is not changed whenctx->config->workeris set:The value of
ctx->statuscan be set to the following values which are defined in flb_lib.h:The places where
ctx->statusis set do not correlate with the value ofctx->config->workerat all:fluent-bit/src/flb_lib.c
Line 168 in f36c956
fluent-bit/src/flb_lib.c
Line 680 in f36c956
fluent-bit/src/flb_lib.c
Line 726 in f36c956
fluent-bit/src/flb_lib.c
Line 732 in f36c956
fluent-bit/src/flb_lib.c
Line 741 in f36c956
In fact we might want to just check the value of
ctx->config->workerand notctx->statusbefore joining it. There is a correlation between both values but I do not see it as being strong enough to depend upon one or the other.The code that calls
pthread_joinbehind that conditional is also meant for extraordinary or erroneous circumstances and therefore being defensive to me does not seem to be a bad policy.References
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.